209 research outputs found

    States versus Rewards: Dissociable Neural Prediction Error Signals Underlying Model-Based and Model-Free Reinforcement Learning

    Get PDF
    Reinforcement learning (RL) uses sequential experience with situations (“states”) and outcomes to assess actions. Whereas model-free RL uses this experience directly, in the form of a reward prediction error (RPE), model-based RL uses it indirectly, building a model of the state transition and outcome structure of the environment, and evaluating actions by searching this model. A state prediction error (SPE) plays a central role, reporting discrepancies between the current model and the observed state transitions. Using functional magnetic resonance imaging in humans solving a probabilistic Markov decision task, we found the neural signature of an SPE in the intraparietal sulcus and lateral prefrontal cortex, in addition to the previously well-characterized RPE in the ventral striatum. This finding supports the existence of two unique forms of learning signal in humans, which may form the basis of distinct computational strategies for guiding behavior

    Anterior Prefrontal Cortex Contributes to Action Selection through Tracking of Recent Reward Trends

    Get PDF
    The functions of prefrontal cortex remain enigmatic, especially for its anterior sectors, putatively ranging from planning to self-initiated behavior, social cognition, task switching, and memory. A predominant current theory regarding the most anterior sector, the frontopolar cortex (FPC), is that it is involved in exploring alternative courses of action, but the detailed causal mechanisms remain unknown. Here we investigated this issue using the lesion method, together with a novel model-based analysis. Eight patients with anterior prefrontal brain lesions including the FPC performed a four-armed bandit task known from neuroimaging studies to activate the FPC. Model-based analyses of learning demonstrated a selective deficit in the ability to extrapolate the most recent trend, despite an intact general ability to learn from past rewards. Whereas both brain-damaged and healthy controls used comparisons between the two most recent choice outcomes to infer trends that influenced their decision about the next choice, the group with anterior prefrontal lesions showed a complete absence of this component and instead based their choice entirely on the cumulative reward history. Given that the FPC is thought to be the most evolutionarily recent expansion of primate prefrontal cortex, we suggest that its function may reflect uniquely human adaptations to select and update models of reward contingency in dynamic environments

    Independent neural computation of value from other people's confidence

    Get PDF
    Expectation of reward can be shaped by the observation of actions and expressions of other people in one's environment. A person's apparent confidence in the likely reward of an action, for instance, makes qualities of their evidence, not observed directly, socially accessible. This strategy is computationally distinguished from associative learning methods that rely on direct observation, by its use of inference from indirect evidence. In twenty-three healthy human subjects, we isolated effects of first-hand experience, other people's choices, and the mediating effect of their confidence, on decision-making and neural correlates of value within ventromedial prefrontal cortex (vmPFC). Value derived from first hand experience and other people's choices (regardless of confidence) were indiscriminately represented across vmPFC. However, value computed from agent choices weighted by their associated confidence was represented with specificity for ventromedial area 10. This pattern corresponds to shifts of connectivity and overlapping cognitive processes along a posterior-anterior vmPFC axis. Task behavior and self-reported self-reliance for decision-making in other social contexts correlated. The tendency to conform in other social contexts corresponded to increased activation in cortical regions previously shown to respond to social conflict in proportion to subsequent conformity (Campbell-Meiklejohn et al., 2010). The tendency to self-monitor predicted a selectively enhanced response to accordance with others in the right temporoparietal junction (rTPJ). The findings anatomically decompose vmPFC value representations according to computational requirements and provide biological insight into the social transmission of preference and reassurance gained from the confidence of others. Significance Statement: Decades of research have provided evidence that the ventromedial prefrontal cortex (vmPFC) signals the satisfaction we expect from imminent actions. However, we have a surprisingly modest understanding of the organization of value across this substantial and varied region. This study finds that using cues of the reliability of other peoples'; knowledge to enhance expectation of personal success generates value correlates that are anatomically distinct from those concurrently computed from direct, personal experience. This suggests that representation of decision values in vmPFC is suborganized according to the underlying computation, consistent with what we know about the anatomical heterogeneity of the region. These results also provide insight into the observational learning process by which someone else's confidence can sway and reassure our choices

    How cognitive and reactive fear circuits optimize escape decisions in humans

    Get PDF
    Flight initiation distance (FID), the distance at which an organism flees from an approaching threat, is an ecological metric of cost–benefit functions of escape decisions. We adapted the FID paradigm to investigate how fast- or slow-attacking “virtual predators” constrain escape decisions. We show that rapid escape decisions rely on “reactive fear” circuits in the periaqueductal gray and midcingulate cortex (MCC), while protracted escape decisions, defined by larger buffer zones, were associated with “cognitive fear” circuits, which include posterior cingulate cortex, hippocampus, and the ventromedial prefrontal cortex, circuits implicated in more complex information processing, cognitive avoidance strategies, and behavioral flexibility. Using a Bayesian decision-making model, we further show that optimization of escape decisions under rapid flight were localized to the MCC, a region involved in adaptive motor control, while the hippocampus is implicated in optimizing decisions that update and control slower escape initiation. These results demonstrate an unexplored link between defensive survival circuits and their role in adaptive escape decisions

    Model-based learning protects against forming habits.

    Get PDF
    Studies in humans and rodents have suggested that behavior can at times be "goal-directed"-that is, planned, and purposeful-and at times "habitual"-that is, inflexible and automatically evoked by stimuli. This distinction is central to conceptions of pathological compulsion, as in drug abuse and obsessive-compulsive disorder. Evidence for the distinction has primarily come from outcome devaluation studies, in which the sensitivity of a previously learned behavior to motivational change is used to assay the dominance of habits versus goal-directed actions. However, little is known about how habits and goal-directed control arise. Specifically, in the present study we sought to reveal the trial-by-trial dynamics of instrumental learning that would promote, and protect against, developing habits. In two complementary experiments with independent samples, participants completed a sequential decision task that dissociated two computational-learning mechanisms, model-based and model-free. We then tested for habits by devaluing one of the rewards that had reinforced behavior. In each case, we found that individual differences in model-based learning predicted the participants' subsequent sensitivity to outcome devaluation, suggesting that an associative mechanism underlies a bias toward habit formation in healthy individuals.This work was funded by a Sir Henry Wellcome Postdoctoral Fellowship (101521/Z/12/Z) awarded to C.M.G. ND is supported by a Scholar Award from the McDonnell FoundationThe authors report no conflicts of interest and declare no competing financial interests.This is the final published version. It first appeared at http://link.springer.com/article/10.3758%2Fs13415-015-0347-6

    Model based control can give rise to devaluation insensitive choice

    Get PDF
    Influential recent work aims to ground psychiatric dysfunction in the brain's basic computational mechanisms. For instance, the compulsive symptoms that feature prominently in drug abuse and addiction have been argued to arise from over reliance on a habitual “model-free” system in contrast to a more laborious “model-based” system. Support for this account comes in part from failures to appropriately change behavior in light of new events. Notably, instrumental responding can, in some circumstances, persist despite reinforcer devaluation, perhaps reflecting control by model-free mechanisms that are driven by past reinforcement rather than knowledge of the (now devalued) outcome. However, another line of theory posits a different mechanism – latent causal inference – that can modulate behavioral change. It concerns how animals identify different contingencies that apply in different circumstances, by covertly clustering experiences into distinct groups. Here we combine both lines of theory to investigate the consequences of latent cause inference on instrumental sensitivity to reinforcer devaluation. We show that instrumental insensitivity to reinforcer devaluation can arise in this theory even using only model-based planning, and does not require or imply any habitual, model-free component. These ersatz habits (like laboratory ones) emerge after overtraining, interact with contextual cues, and show preserved sensitivity to reinforcer devaluation on a separate consumption test, a standard control. Together, this work highlights the need for caution in using reinforcer devaluation procedures to rule in (or out) the contribution of different learning mechanisms and offers a new perspective on the neurocomputational substrates of drug abuse

    Tonic Dopamine Modulates Exploitation of Reward Learning

    Get PDF
    The impact of dopamine on adaptive behavior in a naturalistic environment is largely unexamined. Experimental work suggests that phasic dopamine is central to reinforcement learning whereas tonic dopamine may modulate performance without altering learning per se; however, this idea has not been developed formally or integrated with computational models of dopamine function. We quantitatively evaluate the role of tonic dopamine in these functions by studying the behavior of hyperdopaminergic DAT knockdown mice in an instrumental task in a semi-naturalistic homecage environment. In this “closed economy” paradigm, subjects earn all of their food by pressing either of two levers, but the relative cost for food on each lever shifts frequently. Compared to wild-type mice, hyperdopaminergic mice allocate more lever presses on high-cost levers, thus working harder to earn a given amount of food and maintain their body weight. However, both groups show a similarly quick reaction to shifts in lever cost, suggesting that the hyperdominergic mice are not slower at detecting changes, as with a learning deficit. We fit the lever choice data using reinforcement learning models to assess the distinction between acquisition and expression the models formalize. In these analyses, hyperdopaminergic mice displayed normal learning from recent reward history but diminished capacity to exploit this learning: a reduced coupling between choice and reward history. These data suggest that dopamine modulates the degree to which prior learning biases action selection and consequently alters the expression of learned, motivated behavior

    Humans decompose tasks by trading off utility and computational cost

    Full text link
    Human behavior emerges from planning over elaborate decompositions of tasks into goals, subgoals, and low-level actions. How are these decompositions created and used? Here, we propose and evaluate a normative framework for task decomposition based on the simple idea that people decompose tasks to reduce the overall cost of planning while maintaining task performance. Analyzing 11,117 distinct graph-structured planning tasks, we find that our framework justifies several existing heuristics for task decomposition and makes predictions that can be distinguished from two alternative normative accounts. We report a behavioral study of task decomposition (N=806N=806) that uses 30 randomly sampled graphs, a larger and more diverse set than that of any previous behavioral study on this topic. We find that human responses are more consistent with our framework for task decomposition than alternative normative accounts and are most consistent with a heuristic -- betweenness centrality -- that is justified by our approach. Taken together, our results provide new theoretical insight into the computational principles underlying the intelligent structuring of goal-directed behavior
    • …
    corecore